In this hands-on, we walk through an end-to-end Affymetrix microarray differential expression workflow using Bioconductor packages. The data analyzed here is a typical clinical microarray data set that compares inflamed and non-inflamed colon tissue in two disease subtypes. For each disease, the differential gene expression between inflamed- and non-inflamed colon tissue will be analyzed. We will start from the raw data CEL files, show how to import them into a Bioconductor ExpressionSet, perform quality control and normalization and finally differential gene expression (DE) analysis, followed by some enrichment analysis.
The workflow and all the necessary tools are wrapped in a Bioconductor package called maEndToEnd (https://www.bioconductor.org/packages/release/workflows/html/maEndToEnd.html)
To install the packages:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("maEndToEnd")
library("maEndToEnd")
List of packages required for the workflow
# General Bioconductor Packages
library(Biobase)
library(oligoClasses)
#Annotation and data import packages
library(ArrayExpress)
library(pd.hugene.1.0.st.v1)
library(hugene10sttranscriptcluster.db)
#Quality control and pre-processing packages
library(oligo)
library(arrayQualityMetrics)
#Analysis and statistics packages
library(limma)
library(topGO)
library(ReactomePA)
library(clusterProfiler)
#Plotting and color options packages
library(gplots)
library(ggplot2)
library(geneplotter)
library(pheatmap)
The data set we are going to use comes from a study (Palmieri et al., 2015), which explore the differences in gene expression in inflamed and non-inflamed tissue. 14 patients suffering from Ulcerative colitis (UC) and 15 patients with Crohn’s disease (CD) were tested, and from each patient inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy (See table below schema). This is a typical clinical data set consisting of 58 arrays in total. Our aim is to analyze differential expression (DE) between the tissues.
The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the measured probe intensities. The data we use have been deposited at ArrayExpress and have the accession code E-MTAB-2967
# Make a working directory
raw_data_dir <- tempdir()
if (!dir.exists(raw_data_dir)) {
dir.create(raw_data_dir)
}
# Get data with getAE function
anno_AE <- getAE("E-MTAB-2967", path = raw_data_dir, type = "raw")
# if getAE function does not work, please manually download the data from ArrayExpress at this link
# https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-2967
raw_data_dir <- "data/microarray/E-MTAB-2967/"
We import the SDRF file with the read.delim function from the raw data folder in order to obtain the sample annotation.
The sample names are given in the column Array.Data.File of the SDRF data table and will be used as rownames for the SDRF file.
We turn the SDRF table into an AnnotatedDataFrame from the Biobase package that we will need later to create an ExpressionSet for our data.
sdrf_location <- file.path(raw_data_dir, "E-MTAB-2967.sdrf.txt")
SDRF <- read.delim(sdrf_location)
rownames(SDRF) <- SDRF$Array.Data.File
SDRF <- AnnotatedDataFrame(SDRF)
Then we need to read the CEL files and create an ExpressionSet.
# We use the function read.celfiles from the oligo package4 to import the files:
raw_data <- oligo::read.celfiles(filenames = file.path(raw_data_dir,
SDRF$Array.Data.File),
verbose = FALSE, phenoData = SDRF)
The columns of interest for us are the following:
identifiers of the individuals, “Source.Name”, “Characteristics.individual.”
disease of the individual, “Factor.Value.disease.”
mucosa type, “Factor.Value.phenotype.”
# subselect those columns and have a look at the pheno data using pData function (Biobase package)
Biobase::pData(raw_data) <- Biobase::pData(raw_data)[, c("Source.Name",
"Characteristics.individual.",
"Factor.Value.disease.",
"Factor.Value.phenotype.")]
head(Biobase::pData(raw_data))
## Source.Name Characteristics.individual. Factor.Value.disease. Factor.Value.phenotype.
## 164_I_.CEL 164_I 164 Crohn's disease non-inflamed colonic mucosa
## 164_II.CEL 164_II 164 Crohn's disease inflamed colonic mucosa
## 183_I.CEL 183_I 183 Crohn's disease non-inflamed colonic mucosa
## 183_II.CEL 183_II 183 Crohn's disease inflamed colonic mucosa
## 2114_I.CEL 2114_I 2114 Crohn's disease non-inflamed colonic mucosa
## 2114_II.CEL 2114_II 2114 Crohn's disease inflamed colonic mucosa
The first step after the initial data import is the quality control of the data. Here we check for outlaiers and we check whether the data clusters are as expected, e.g. grouped by the experimental conditions. The expression intensity values are in the assayData sub-object “exprs” and can be accessed by the exprs(raw_data) function (Biobase package).
# Access raw data
Biobase::exprs(raw_data)[1:5, 1:5]
# Make a boxplot
oligo::boxplot(raw_data, target = "core", main = "Boxplot of log2-intensitites for the raw data")
# Make a histogram
oligo::hist(raw_data, target = "core", main = "Hist of log2-intensitites for the raw data")
# Perform PCA analysis (on log transformed data) and plot it
exp_raw <- log2(Biobase::exprs(raw_data))
PCA_raw <- prcomp(t(exp_raw), scale. = FALSE)
dataGG <- data.frame(PC1 = PCA_raw$x[,1], PC2 = PCA_raw$x[,2],
Disease = pData(raw_data)$Factor.Value.disease.,
Phenotype = pData(raw_data)$Factor.Value.phenotype.,
Individual = pData(raw_data)$Characteristics.individual.)
ggplot(dataGG, aes(PC1, PC2, colour = Phenotype)) +
geom_point(size = 3) +
stat_ellipse()
ggplot(dataGG, aes(PC1, PC2, colour = Disease)) +
geom_point(size = 3) +
stat_ellipse()
## 164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL
## 1 4496 5310 4492 4511 2872
## 2 181 280 137 101 91
## 3 4556 5104 4379 4608 2972
## 4 167 217 99 79 82
## 5 89 110 69 58 47
# Optional step
# Array Quality Metrics stats (comprehensive quality metrics analysis. Useful to identify outlaiers)
# It takes about 10 minutes (take a break)
dir.create("E-MTAB-2967/QualiMetrics")
arrayQualityMetrics(expressionset = raw_data,
outdir = "E-MTAB-2967/QualiMetrics/",
force = TRUE, do.logtransform = TRUE,
intgroup = c("Factor.Value.disease.", "Factor.Value.phenotype."))
# Open the index.html file in the folder and inspect the plots
[index]("data/microarray/E-MTAB-2967/QualiMetrics/index.html")
Now, we can apply the RMA algorithm to our data in order to background-correct, normalize and summarize:
# Just one command to perform Background correcting, Normalization, Expression
norm_data <- oligo::rma(raw_data, target = "core")
## Background correcting
## Normalizing
## Calculating Expression
Let’s re-evaluate the data after normalization.
# Make a boxplot
oligo::boxplot(norm_data, target = "core", main = "Boxplot of log2-intensitites for the norm data")
# Make a histogram
oligo::hist(norm_data, target = "core", main = "Hist of log2-intensitites for the norm data")
### PCA plot
exp_norm_data<- Biobase::exprs(norm_data)
PCA <- prcomp(t(exp_norm_data), scale = FALSE)
dataGG_norm <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
Disease = pData(norm_data)$Factor.Value.disease.,
Phenotype = pData(norm_data)$Factor.Value.phenotype.,
Individual = pData(norm_data)$Characteristics.individual.)
ggplot(dataGG_norm, aes(PC1, PC2, colour = Phenotype)) +
geom_point(size = 3) +
stat_ellipse()
ggplot(dataGG_norm, aes(PC1, PC2, colour = Disease)) +
geom_point(size = 3) +
stat_ellipse()
# Heat-map hierarchical clustering (Optional)
dists <- as.matrix(dist(t(exp_norm_data), method = "manhattan"))
rownames(dists) <- row.names(pData(norm_data))
diag(dists) <- NA
pheatmap(dists, legend = TRUE,
treeheight_row = 0,
legend_breaks = c(min(dists, na.rm = TRUE),
max(dists, na.rm = TRUE)),
legend_labels = (c("small distance", "large distance")),
main = "Clustering heatmap for the calibrated samples")
We now filter out lowly expressed genes. Microarray data commonly show a large number of probes in the background intensity range. These probes also do not change much across arrays. Hence they combine a low variance with a low intensity. Thus, they could end up being detected as differentially expressed although they are barely above the “detection” limit and are not very informative in general.
We will perform a “soft” intensity based filtering here, since this is recommended by the limma user guide (a package we will use below for the differential expression analysis).
data_norm_medians <- rowMedians(Biobase::exprs(norm_data))
hist_res <- hist(data_norm_medians, 100, freq = FALSE,
main = "Histogram of the median intensities",
xlab = "Median intensities")
# Set a cutoff
man_threshold <- 4
abline(v = man_threshold, col = "coral4", lwd = 2)
samples_cutoff <- 14
idx_man_threshold <- apply(Biobase::exprs(norm_data), 1,
function(x){
sum(x > man_threshold) >= samples_cutoff})
table(idx_man_threshold)
data_manfiltered <- subset(norm_data, idx_man_threshold)
## idx_man_threshold
## FALSE TRUE
## 10493 22804
In order to analyse which genes are differentially expressed between inflamed and non-inflamed tissue, we have to fit a linear model to our expression data. Linear models are the “workhorse” for the analysis of experimental data. They can be used to analyse complex designs.
For building our linear model, we must think about which experimental variables we want to consider. As we want to find differential expression between non-inflamed and inflamed tissue, in principle, those are the only two variables we would have to consider.
individual <- as.character(Biobase::pData(data_manfiltered)$Characteristics.individual.)
tissue <- str_replace_all(Biobase::pData(data_manfiltered)$Factor.Value.phenotype., " ", "_")
tissue <- ifelse(tissue == "non-inflamed_colonic_mucosa","nI", "I")
disease <- str_replace_all(Biobase::pData(data_manfiltered)$Factor.Value.disease.," ", "_")
disease <- ifelse(str_detect(Biobase::pData(data_manfiltered)$Factor.Value.disease., "Crohn"), "CD", "UC")
# Let consider both "CD" & "UC"
i_CD <- individual[disease == "CD"]
design_palmieri_CD <- model.matrix(~ 0 + tissue[disease == "CD"] + i_CD)
colnames(design_palmieri_CD)[1:2] <- c("I", "nI")
rownames(design_palmieri_CD) <- i_CD
i_UC <- individual[disease == "UC"]
design_palmieri_UC <- model.matrix(~ 0 + tissue[disease == "UC"] + i_UC )
colnames(design_palmieri_UC)[1:2] <- c("I", "nI")
rownames(design_palmieri_UC) <- i_UC
Before heading on to find all differentially expressed genes for both diseases, we will first have a look at how this works in principle for one gene. We will fit the linear model for one gene and run a t-test in order to see whether the gene is differentially expressed or not.
tissue_CD <- tissue[disease == "CD"]
crat_expr <- Biobase::exprs(data_manfiltered)["8164535", disease == "CD"]
crat_data <- as.data.frame(crat_expr)
colnames(crat_data)[1] <- "org_value"
crat_data <- mutate(crat_data, individual = i_CD, tissue_CD)
crat_data$tissue_CD <- factor(crat_data$tissue_CD, levels = c("nI", "I"))
ggplot(data = crat_data, aes(x = tissue_CD, y = org_value,
group = individual, color = individual)) +
geom_line() +
ggtitle("Expression changes for the CRAT gene")
We see that overall, this gene is expressed less in inflamed tissue. We also see that the absolute expression intensities vary greatly between patients.
# Fitting Linear Model
crat_coef <- lmFit(data_manfiltered[,disease == "CD"],
design = design_palmieri_CD)$coefficients["8164535",]
# In order to now obtain the expression values fitted by the model, we multiply the design matrix with this vector of coefficients crat_coef:
crat_fitted <- design_palmieri_CD %*% crat_coef
rownames(crat_fitted) <- names(crat_expr)
colnames(crat_fitted) <- "fitted_value"
crat_data$fitted_value <- crat_fitted
ggplot(data = crat_data, aes(x = tissue_CD, y = fitted_value,
group = individual, color = individual)) +
geom_line() +
ggtitle("Fitted expression changes for the CRAT gene")
Now we conduct the t-test on the linear model in order to find out whether the difference between non-inflamed and inflamed tissue differs significantly from 0:
crat_noninflamed <- na.exclude(crat_data$org_value[tissue == "nI"])
crat_inflamed <- na.exclude(crat_data$org_value[tissue == "I"])
res_t <- t.test(crat_noninflamed ,crat_inflamed , paired = TRUE)
res_t
##
## Paired t-test
##
## data: crat_noninflamed and crat_inflamed
## t = 6.8493, df = 14, p-value = 7.943e-06
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.2919389 0.5581289
## sample estimates:
## mean difference
## 0.4250339
We get a low p-value close to 0 and thus can conclude that the CRAT gene is differentially expressed between noninflamed and inflamed tissue.
contrast_matrix_CD <- makeContrasts(I-nI, levels = design_palmieri_CD)
palmieri_fit_CD <- eBayes(contrasts.fit(lmFit(data_manfiltered[,disease == "CD"],
design = design_palmieri_CD),
contrast_matrix_CD))
table_CD <- topTable(palmieri_fit_CD, number = Inf)
head(table_CD)
hist(table_CD$P.Value, col = brewer.pal(3, name = "Set2")[1],
main = "inflamed vs non-inflamed - Crohn’s disease", xlab = "p-values")
nrow(subset(table_CD, P.Value < 0.001))
volcanoplot(palmieri_fit_CD, coef = 1L, style = "p-value", highlight = 20,
xlab = "Log2 Fold Change", ylab = NULL, pch=16, cex=0.35)
## logFC AveExpr t P.Value adj.P.Val B
## 7928695 -0.5989663 7.739356 -7.070252 1.349648e-06 0.02143364 5.325176
## 8123695 -0.4855473 6.876406 -6.329429 5.749965e-06 0.02143364 4.050318
## 8164535 -0.4250339 6.732139 -6.245285 6.810636e-06 0.02143364 3.899844
## 8009746 -0.5182389 5.561874 -6.216695 7.215416e-06 0.02143364 3.848459
## 7952249 -0.8484355 5.605752 -6.207947 7.344159e-06 0.02143364 3.832712
## 8105348 0.8311803 5.301460 6.078746 9.547745e-06 0.02143364 3.598694
## [1] 630
Before we continue with functional analysis we need to add annotation information to the transcript cluster identifiers stored in the featureData of our ExpressionSet.
# We use the function select from AnnotationDbi to query the gene symbols and associated short descriptions for the transcript clusters.
anno_data <- AnnotationDbi::select(hugene10sttranscriptcluster.db,
keys = (featureNames(data_manfiltered)),
columns = c("SYMBOL", "GENENAME"),
keytype = "PROBEID")
# we filtered out the probes that do not map to a gene, i.e. that do not have a gene symbol assigned.
anno_data <- subset(anno_data, !is.na(SYMBOL))
When we have a large list of genes of interest, such as a list of differentially expressed genes, how do we extract biological meaning from it?
One way to do so is to perform functional enrichment analysis. This method consists of applying statistical tests to verify if genes of interest are more often associated with certain biological functions than what would be expected in a random set of genes. To perform functional enrichment analysis, we need to have: * A set with all the genes to consider in the analysis: universe set (which must contain the study set) * A set of genes of interest (e.g., differentially expressed genes): study set
# Get the universe set and remove duplicated entries
universe <- anno_data[!duplicated(anno_data$SYMBOL), 2]
# Get a set of genes of interest, Differentially expressed with p.val < 0.1. Then annotate and remove duplicates
DE_genes_CD <- subset(table_CD, adj.P.Val < 0.1)
anno_DE_genes_CD <- AnnotationDbi::select(hugene10sttranscriptcluster.db,
keys = (rownames(DE_genes_CD)),
columns = c("SYMBOL", "GENENAME"),
keytype = "PROBEID")
anno_DE_genes_CD <- subset(anno_DE_genes_CD, !is.na(SYMBOL))
symb <- anno_DE_genes_CD[!duplicated(anno_DE_genes_CD$SYMBOL), 2]
ego <- enrichGO(gene = symb,
universe = universe,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
goplot(ego)
barplot(ego, showCategory=20)
dotplot(ego, showCategory=20)
As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to trace down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_2.17 fontawesome_0.4.0 captioner_2.2.3
## [4] bookdown_0.29 knitr_1.40 maEndToEnd_2.16.0
## [7] enrichplot_1.16.2 Rgraphviz_2.40.0 openxlsx_4.2.5
## [10] genefilter_1.78.0 matrixStats_0.62.0 stringr_1.4.1
## [13] tidyr_1.2.1 dplyr_1.0.10 RColorBrewer_1.1-3
## [16] pheatmap_1.0.12 geneplotter_1.74.0 annotate_1.74.0
## [19] XML_3.99-0.11 lattice_0.20-45 ggplot2_3.3.6
## [22] gplots_3.1.3 clusterProfiler_4.4.4 ReactomePA_1.40.0
## [25] topGO_2.48.0 SparseM_1.81 GO.db_3.15.0
## [28] graph_1.74.0 limma_3.52.4 arrayQualityMetrics_3.52.0
## [31] hugene10sttranscriptcluster.db_8.8.0 org.Hs.eg.db_3.15.0 AnnotationDbi_1.58.0
## [34] pd.hugene.1.0.st.v1_3.14.1 DBI_1.1.3 oligo_1.60.0
## [37] RSQLite_2.2.18 Biostrings_2.64.1 GenomeInfoDb_1.32.4
## [40] XVector_0.36.0 IRanges_2.30.1 S4Vectors_0.34.0
## [43] ArrayExpress_1.56.0 oligoClasses_1.58.0 Biobase_2.56.0
## [46] BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 tidyselect_1.2.0 htmlwidgets_1.5.4
## [4] beadarray_2.46.0 BiocParallel_1.30.4 scatterpie_0.1.8
## [7] munsell_0.5.0 codetools_0.2-18 preprocessCore_1.58.0
## [10] interp_1.1-3 withr_2.5.0 colorspace_2.0-3
## [13] GOSemSim_2.22.0 highr_0.9 rstudioapi_0.14
## [16] setRNG_2022.4-1 DOSE_3.22.1 labeling_0.4.2
## [19] MatrixGenerics_1.8.1 GenomeInfoDbData_1.2.8 hwriter_1.3.2.1
## [22] polyclip_1.10-0 bit64_4.0.5 farver_2.1.1
## [25] downloader_0.4 vctrs_0.4.2 treeio_1.20.2
## [28] generics_0.1.3 xfun_0.34 affxparser_1.68.1
## [31] R6_2.5.1 illuminaio_0.38.0 graphlayouts_0.8.2
## [34] gridSVG_1.7-4 bitops_1.0-7 cachem_1.0.6
## [37] fgsea_1.22.0 gridGraphics_0.5-1 DelayedArray_0.22.0
## [40] assertthat_0.2.1 scales_1.2.1 ggraph_2.1.0
## [43] nnet_7.3-18 gtable_0.3.1 affy_1.74.0
## [46] tidygraph_1.2.2 rlang_1.0.6 systemfonts_1.0.4
## [49] splines_4.2.0 lazyeval_0.2.2 hexbin_1.28.2
## [52] checkmate_2.1.0 BiocManager_1.30.18 yaml_2.3.6
## [55] reshape2_1.4.4 backports_1.4.1 qvalue_2.28.0
## [58] Hmisc_4.7-1 tools_4.2.0 ggplotify_0.1.0
## [61] affyio_1.66.0 jquerylib_0.1.4 ff_4.0.7
## [64] Rcpp_1.0.9 plyr_1.8.7 base64enc_0.1-3
## [67] zlibbioc_1.42.0 purrr_0.3.5 RCurl_1.98-1.9
## [70] rpart_4.1.16 openssl_2.0.4 deldir_1.0-6
## [73] viridis_0.6.2 SummarizedExperiment_1.26.1 ggrepel_0.9.1
## [76] cluster_2.1.4 magrittr_2.0.3 data.table_1.14.4
## [79] DO.db_2.9 reactome.db_1.81.0 patchwork_1.1.2
## [82] evaluate_0.17 xtable_1.8-4 jpeg_0.1-9
## [85] gcrma_2.68.0 gridExtra_2.3 compiler_4.2.0
## [88] tibble_3.1.8 KernSmooth_2.23-20 shadowtext_0.1.2
## [91] crayon_1.5.2 htmltools_0.5.3 ggfun_0.0.7
## [94] Formula_1.2-4 aplot_0.1.8 tweenr_2.0.2
## [97] rappdirs_0.3.3 MASS_7.3-58.1 Matrix_1.5-1
## [100] cli_3.4.1 vsn_3.64.0 parallel_4.2.0
## [103] igraph_1.3.5 GenomicRanges_1.48.0 pkgconfig_2.0.3
## [106] foreign_0.8-83 foreach_1.5.2 ggtree_3.4.4
## [109] svglite_2.1.0 bslib_0.4.0 BeadDataPackR_1.48.0
## [112] affyPLM_1.72.0 yulab.utils_0.0.5 digest_0.6.30
## [115] base64_2.0.1 fastmatch_1.1-3 tidytree_0.4.1
## [118] htmlTable_2.4.1 gtools_3.9.3 graphite_1.42.0
## [121] lifecycle_1.0.3 nlme_3.1-160 jsonlite_1.8.3
## [124] viridisLite_0.4.1 askpass_1.1 fansi_1.0.3
## [127] pillar_1.8.1 KEGGREST_1.36.3 fastmap_1.1.0
## [130] httr_1.4.4 survival_3.4-0 glue_1.6.2
## [133] zip_2.2.1 png_0.1-7 iterators_1.0.14
## [136] bit_4.0.4 sass_0.4.2 ggforce_0.4.1
## [139] stringi_1.7.8 blob_1.2.3 caTools_1.18.2
## [142] latticeExtra_0.6-30 memoise_2.0.1 ape_5.6-2
Below a list of references and articles.
Affymetrix GeneChip microarray preprocessing for multivariate analyses